DOCOBBBT BESOHfi 



ID ITU 661 



TH 009 537 



AOTHOH 
TITLE 

IHSTITOTIOH 
BBPCHT KC 
FOE DiTI 
NOTE 



MasseYr Handy H.; And ethers 

Performance Appraisal Ratings: The content Issue. 

Final Beport^ June 1976 through ?t3gust 1978. 

Air Force Human Resources Lab. , Brooks AFBr T«?xas. 

AFHBL-TR-78-69 

Dec 78 

superintendent of DccunentSr O.S. Government Printing 
Office^ Washington, E.G. 20U02 (Stock Number 
671-056/109) 



IBIS PRICI 
DE SCRI PTORS 



IDEBTIFIIRS 



MF01/PC01 Plus Postage. ^ ^ 

Content Analysis; Evaluation Criteria; Individual 
Characteristics; *Officer Personnel; *Peer 
Evaluation; *Personnel Evaluation; *Rating Scales^ 
♦Task Performance; Test Items; Test Reliability; 
♦Berk Attitudes 
Air Force 



ABSTRACT 

Three kinds of r 
worker -criented, and task-orient 
periittinq the comparisons to be 
to the ratings. One hundred twen 
assiqned tc seminar groups of 13 
superiority vas found for any of 
corielaticns with various extern 
three experimental conditions. S 
found among tte three rating sub 
treatment qiotps although these 
randomly to the three treatment 
for qrcup effects in peer group 



ating statements, trait-oriented, 
ei were evalua ted in a conte xt 

made in terms cf criteria external 
ty Air Force ncncommi ssioned officers 
or 1U were involved. No evidence of 
the three sets although significant 
al criteria were obtained in all 
ignificant differences were also 
-groups comprising each of the three 
rating sub-groups were assigned 
groups. The importance cf controlling 
studies was noted. (author/KH) 



* Reproductions supplied by HDRS are the best ^.hat can be mad^ * 

4t from the original document. 

*********************************************************************** 



ERLC 



AFHRL-TR-78-69 



AIR FORCE 



vO 



CO 



us 06PARTM6NT0F HEALTH. 
EDUCATION AW6LFA«6 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS OOCUWENT HAS BEEN REPRO- 
DUCED EXACTLY AS PECEJVEO FROM 
THE PERSON OR 0B&ANI2ATI0N ORIGIN. 
AT.NO'T POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRE- 
SENT OF FICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OW POLICY 



By 

Randy H. Massey, Capt, USAF 
Cecil J. Mull ins 
James A. Earles 



PERSONNEL RESEARCH DIVISION 
Brooks Air Force Base, Texas 78235 



PERFORMANCE APPRAISAL RAUIMGS: 
^ THE CONTENT ISSUE 

U 
M 
A 
N 

R 
E 
S 
0 

U 
R 
C 
E 

S LABORATORY 



December 1978 
Final Report for Period Jun« 1976 - August 1978 



Approved for public release; distribution unlimited. 



ERIC 



AIR FORCE SYSTEMS COMMAND 

BROOKS AIR FORCE BA5EJEXAS 78235 
2 



NOTICE 



When U.S. Government drawings, specifications, or other data are used 
• )r any purpose other than a definitely related Government 
procurement operation, the Government thereby incurs no 
responsibility nor any obligation whatsoever, and the fact that the 
Government may have formulated, furnished, or in any way supplied 
the said drawings, specifications, or other data is not to be regarded by 
impUcation or otherwise, as in any manner licensing the holder or any 
other person or corporation, or conveying any rights or permission to 
manufacture, use, or sell any patented invention that may in any way 
be related thereto. 

This final report was submitted by Personnel Research Division, under 
project 2313, with HQ Air Force Human Resources Uboratory 
(AFSC), Brooks Air Force Base, Texas 78235. Randy H.Massey (PEP) 
was the Principal Investigator for the Laboratory. 

This report has been reviewed and cleared for open publication and/or 
public release by the appropriate Office of Information (OO in 
accordance with AFR 190-17 and DoDD 5230.9. There is no objection 
to unlimited distribution of this rep. * to the public at large or by DDC 
to the National Technical Information Service (NTIS). 

This technical report has been reviewed and is approved for publication. 

LELAND D. BROKAW, Technical Director 
Personnel Research Division 



RONALD W. TERRY, Colonel. USAF 
Commander 



Unclassified 



REPORT DOCUMENTATION PAGE 


READ INSTRUCTIONS 
BEFORE COMPLETING FORM 


1. REPORT NUMBER *■ ^'^^ ' #*^^L.aaiL-. 

AFHRL-TR-78-69 


3. RECIPIENT'S CATALOG NUMBER 


4. TiTLE (mnd Subtitle) 

PERFORMANCE APPRAISAL RATINGS: THE CONTENT ISSUE 


S TYPE OF REPORT & PERIOD COVERED 

Final 

June 1976 - August 1978 


6. PERFORMING ORG. REPORT NUMBER 


7. AuTMORr«; 

Randy H. Massey 
Cecil J. MuUins 
James A. Earles 


8. CONTRACT OR GRANT NuMBERCa; 


9. PERFORMING ORGANIZATION NAME AND ADDRESS 

Personnel Research Division 

Air Force Human Resources Laboratory 

Brooks Air Force Base, Texas 78235 


10. PROGRAM ELEMENT. PROJECT. TASK 
AREA & WORK UNIT NUMBERS 

61102F 

2313T602 


n. CONTROLLING OFFICE NAME AND ADDRESS 

HQ Air Force Human Resources Laborator>' (AFSC) 
Brooks Air Force Base, Texas 78235 


!2. REPORT DATE 

December 1978 


13. NUMBER OF PAGES 
20 


U. MONITORING AGENCY NAME & ADDRESS^/ different from Controtting Office) 


15. SECURITY CLASS, (of thim report) 

Unclassified 


"lS«. DECLASSIFICATION/ DOWN GRADING 
SCHEDULE 


16. DISTRIBUTION AT EMEHT (of thi a Report) 

Approved for public release: distribution unlimited. 


17. DISTRIBUTION STATEMENT (of th. •batr^ct entered In Btock 30. If dlffer^t from Report) 


18. SUPPLEMENTARY NOTES 

SM Study Nr. 6728 


19. KEY WORDS rConf/nua on rever.o aide If necesa.ry -nd Ider^'Jfy hy btock number) 

rating statements task-oriented dimensions 
performance ratings traitoriented dimensions 
peer ratings worker-oriented dimensions 
peer rankings 


20 AB^^HACT (Continue on r^v^r^^aidB If nec^BBBry and Identify by btock r^umber) ^ , . . i , 

1 .ree kinds of rating statements, trait-oricntoJ, worker^jricnted, and task-oriented, were evaluated in a 
con<*rv. permitting the comparisons to be made in ni f criteria external to the ratings. Orie hundred twenty Air 
Force noncommissioned officers assigned to semv^M groups of 13 or 14 at the Air Trainmg Command NCO 
Academy, Lackla^id AFB Annex, were involved in thii experiment. 

No evidence of superiority was found for any of the three sets although significant correlations with various 
external criteria were obtained in all three experimental conditions. Significant differences were also found aniong 
the three rating sub^oups comprising each of the three treatment groups althou^ these rating sub-groups were 
aSri^ed randomly to the three treatment groups. The importance of controUing for group effects in peer group 



DD 1JAn"73 1473 COITION OF 1 NOV 65 IS OBSOLETE UncUSSlficd 



SECURITY CLASSIFICATIOH OF THIS PAGt (When Dmtm Entered) 

4 

o 

ERIC 



SECURITY CLASSIFICATION OF THIS PAGE(W7»>n D»ta EnUf^d) 



SEC 



ITriTV classification of this PAGEfWTi^n D»f Enffd) 



5 



ERIC 



PREFACE 



This research was conducted under project 2?13, Research on Human Factors in 
Aero Systems; task 2313T6, Force Acquisition, Assignment, and Evaluation. 



1 

6 



TABLE OF CONTENTS 



Page 

L Introduction 

.... 7 

II. Method 

7 

Sample ^ 

Rating Scales ^ 

Rating Tasks ^ 

Research Approach and Rationale • • • ^ 

Data Analysis . . • 

... 10 

III. Results and Discussion 

IV. Summary and Conclusions ^"^ 

13 

References 

Appendix A: Rating Dimensions 

LIST OF TABLES 

Page 

10 



Table 

1 Analysis of Variance of Number of "Hits^^ (Correct Profile Identifications) 
by Treatment and Seminar Group 

2 Number of Profile Identifications (Hits) by Treatment and Seminar Group 10 

3 Rank Order Correlations Among Unidentified Profile Rankings, Peer Rankings, 

and Official Rank by Treatment and by Seminar Group 

4 Analysis of Variance of Squared Deviations between Unidentified Profile 

Rankings and Peer Rankings by Treatment and by Seminar Group 

5 Analysis of Variance of Squared Deviations between Unidentified Profile 

Rankings and Official Rankings by Treatment and by Seminar Group 

6 Analysis of Variance of Squared Deviations between Peer Rankings and 

Official Rankings by Treatment and by Seminar Group 



11 
11 
11 

12 



EKLC 



PERFORMANCE APPRAISAL RATINGS: THE CONTENT ISSUE 



I. INTRODUCTION 

Much research done on ratings has been concerned with efforts to determine the best stimulus 
statements to use in a rating situation. Unfortunately, in much of this research "best" has been defined in 
terms of psychometric properties inherent in the ratings. Little research has been done employing external 
criteria for evaluating rating statements. 

This is one in a series of studies intended to help resolve the content issue of rating statements. This 
study focuses on the relative merits of rating statements with content selected to represent different points 
on a continuum from highly job-specific statements to person-oriented, trait-like statements. A context was 
constructed which provides an opportunity to evaluate the usefulness of various sets of rating statements 
against criteria external to the ratings, rather than the more traditional method of evaluating rating 
statements in terms of their internal psychometric characteristics. 

The generally accepted viewpoint is that the more specific observable behaviors are more accurately 
rated than general personality descriptive statements. This viewpoint appears to be based more on the 
selective appraisal of a narrow spectrum of studies than on an appraisal of all studies conducted in the field 
(Kavanagh, 1971). In any case, the difficulties and controversial issues inherent in ratings have been well 
documented (e.g., Barrett, 1966; Kavanagh, 1971; Ronan & Prien, 1971;Schmidt & Kaplan, 1971). 

A popular scaling procedure designed to measure job performance is the Behavioral Expectation 
Scales (BES) developed by Smith and Kendall (1963). In this procedure, the important performance 
dimensions are identified and defined by a group of individuals responsible for evaluations. The scales are 
anchored by actual job behaviors which represent specific performance levels. The BES has had 
considerable intuitive appeal, and there have been many proponents of the technique (e.g., Campbell, 
Dunnette, Arvey, HeUervik, 1973; CampbeU, Dunnette, Lawler, & Weick, 1970; Dunnette, 1966; Landy, 
Farr, Saal, & Freytag, 1976; Zedeck & Blood, 1974). BES scales have also been developed for a variety of 
occupations (e.g., Arvey & Hoyle, 1974; Landy et al., 1976; Smith & KendaU, 1963). This may account for 
the belief that behavior-based rating statements are superior to trait-oriented statements. 

Despite its popularity, a review of studies in which BES was compared to other formats does not 
provide overwhelming support for the BES. Bumaska and HoUmann (1974), in examining the psychometric 
characteristics of three different rating scale formats (BES, BES without anchors, and another set of a priori 
dimensions), found no differences among the formats with respect to halo, rater bias, or leniency. They 
concluded that "There is no evidence for superiority of any one format." Other investigators (e.g., 
Dickinson & Tice, 1973; Zedeck & Baker, 1972) have found little advantage in terms of discriminant or 
convergent validity of BES obtained ratings. BES ratings have also been found to be non-transferable within 
the same occupation from the original developed setting to another similar work setting (Borman & Vallon, 
1974). The BES exhibited no superiority over a more simple scale (BES without anchors) on interrater 
agreement and halo effect. In fact, the simpler scale showed significantly less leniency effect (lower 
adjusted mean ratings and greater adjusted standard deviations) than the BES format. In short, the 
literature does not provide overwhelming support for the superiority of BES over other scale formats. 

Other popular methodologies include deriving rating scales based on patterns of job requirements 
(McCormick, 1959) and the multitrait-multimethod approach to measuring job performance (Campbell & 
Fiske, 1959). McCormick (1959) emphasizes the importance of using job-oriented and worker^riented 
statements primarily derived from job analysis techniques. Job-oriented statements describe the job 
content, or what is accomplished by the worker (repair water pump, inspect lubrication system, drive 
pickup truck, etc.). Worker-oriented statements tend to characterize generalized human behaviors or worker 



5 



8 



characteristics wWch are usually descriptive across many different jobs observe visual d'Splays. judge 
condition or quaUty. manually pour ingredients into container, etc.). In the mult.tra.t.mdt.me^od 
approach, data from many traits and raters are analyzed for convergent and discrimmant validity (Campbell 
& Fiske 1959) The concepts of convergent and discriminant validity, in the context of the Campbell and 
Fiske paper appear to apply primarily in situations where there is no clearly preferred smgle target or 
criterion variable available. Convergent validity is represented by the size of the correlations among data 
sets from independent sources, such as separate raters, and discriminant validity is represented by the size 
of the correlations among different variables obtained from the same source, such as separate rating 
statements from the same rater. Obviously, one prefers the convergent validity correlation coefficients to be 
high and the discriminant vaUdity correlation coefficients to be low. In the rating situation, to the degree 
that correlation coefficients representing discriminant validities are high, one suspects that a large amount 
of halo error is present. The multitrait-multimethod approach offers evidence tha. traits can be effective ui 
performance eviuation devices (Kavanagh. 1971; KeUey & Fiske. 1951). The BES and McCorm^ck (1959) 
approaches basically assume the superiority of behavior-based or task^riented type dimensions. There is no 
comparative evidence to indicate the superiority of any of the methodologies. 

A common issue underlying all rating methodological approaches is the "content issue" defined by 
Kavanaeh (1971) as "the issue of the relative representativeness of traits . . . along a continuum ranging 
frrsutjective to objective, abstract to concrete, or personality to performance." He concluded that there 
no otr^l aiming evidenc to indicate the superiority of behavior-based over trait^nented dimensions^ 
He ?uX7suggests that contradictory findings across reliability and validity studies could be partialj^ 
at riSSed to a Mure to resolve or control for the "content issue." Resolution of this issue may give insight 
^to the effectiveness of various performance evaluation methodologies, particularly in relation to time and 
expended. Settlement of this issue can also have significant explanatory value accounting for the 
numerous contradictory findings that exist in performance appraisal research. 

Kavanagh. MacKinney. and WoUins (1971) were the first to directly address Uie content issue m.ng 
the multirate?;;ultimethod approach, by investigating middle managers using P"f°™^" 
suneriors and two subordinates. They found more convergent validity for personal traits than performance 
Ss butTo differencelT discriminant validity. Although the higher personal trait convergent validity was 
IcclpaiieS by a greater degree of "halo." the overaU conclusion was that ratings of personal traits did as 
well as the ratings of performance traits. 

Since Kavanaeh (1971) the content issue has been almost entirely ignored. Recently Borman and 
Dunnetr(19 5Ta«empt to resolve the content issue by comparing behavior-based statements wijh 
u7:ZL statementsVeir conclusions were, "at present little empirical ^-^^^^PX"' ar 

incremental validity of performance ratings made using behavior scales.' Unfortunately. th«e are 
methodological problems associated with their study. They compared three dif^rent "^"J^.f^"^^ 
(performance anchored, performance non-anchored, and trait^riented statements obtained from the Nava^ 
Om erFHness Report) rather than just comparing three rating formats. In sum. the study did not dire tly 
?o[" on ^content ssue of ratiilg criteria, but rather on the effectiveness of three different ra ing 
Sst^ms Among other experimental difficulties, they compared different numbers of rating sta ements 
IZZ treatments and included trait-like statements (integrity, responsibility, and dedication) within the 
performance treatment category. 

It seems clear, then, that the issue of the preferred content for rating statements has in no way been 
resolved by previous research. This study is one in a series of studies using criteria externa to the " mgs to 
attempt such a resolution. It is anticipated that this approach will be more effective in resolving the content 
Usue than were past studies that employed internal characteristics of the rating instrument as catena for 
judging the excellence of rating statements. 




II. METHOD 



Sample 

One hundred twenty students assigned to the Air Training Command (ATC) NCO Academy at 
Lackland AFB Annex completed the rating tasks. The study included nine separate seminar groups, each 
consisting of 13 or 14 noncommissioned officers (E6s to E7s) whose length of military service was 10 to 17 
years. 

Rating Scales 

The treatment conditions in this study varied across three different types of rating statements 
(task^riented, worker^riented, and trait^riented). Ten rating statements representing each of the three 
different kinds of rating content were included in the study. These were determined by consultation with 
instructors, administrative officials, and students. Previously conducted studies were also reviewed to 
identify factors. Each of the 10 rating attributes was rated on a 5-point scale as follows: 

Well 

Below Above Above 

Average Average Average Average Outstanding 

Specific 

Ratable ' " ' ' ' 

Attribute 

Trait^riented attributes also included a brief descriptive definition. See Appendix A for a complete Ust and 
description of the rating statements. 

Rating Tasks 

The research was conducted in two phases. In Phase I, each student rated all members in his seminar 
group on one and only one of the three different types of statements-task^riented, worker-onentcd, and 
traitK)riented. This phase resulted in the generation of individual profiles based on the group's evaluation of 
each member on each of the 10 selected rating attributes. 

In Phase II, about 2 weeks later, the experimenter handed out the profiles to the seminar group 
without an identifying name on the profiles. Each subject was required to perform three tasks: first, he had 
to rank^rder the profiles according to predicted seminar class rank; second, he had to identify to whom 
each profile belonged; and third, he had to predict the final school seminar class rank of his seminar peers 
without any regard to profile considerations. Subjects appeared unaware of the nature of the study until 
Phase II research when they were asked to identify each of tht: profiles. 

Research Approach and Rationale 

Many studies into the relative efficiency of sets of rating statements have apparently started with a 
basic set of assumptions. First, the raters are subject to leniency error resulting in elevated means and to 
halo error revealed by small standard deviations among the ratings assigned. Since these two forms of rating 
error are revealed by the indicated statistics, a study of means and standard deviations forms a basis for 
comparison among sets of rating statements which may be used to distinguish among sets as to their 
goodness Second, if rating statements are meaningful, and if raters are accurate in their perceptions ot 
ratees then inter-judge agreement, in the form of correlations among sets of ratings issuing from different 
judges' will be an expression of the goodness of a set of raUngs. Third, the most useful way to compare sets 
of rating statements with each other Ues in the comparisons which can be made among the summary 
statistics produced by the ratings. If one accepts these assumptions, then it foBows that the best way to 
compare sets of rating descriptions is as it has frequently been done-the best set is that set which produces 
lower means, larger standard deviations, and larger inter judge correlation coefficients. 



7 

0 

o 

ERIC 



However, the foregoing assumptions are subject to challenge. Taking them in order: 

1. The evidence seems clear that leniency and halo errors do occur. It is less clear how important 
these two errors are in a family of other possible errors (e.g., racial bias, low rater motivation low 
observaSt^ T he ratee, and others). It is al.o clear that there is not a direct -l^^ionship between en.ency 
mor and iSger means or between halo error and smaller standard deviations. A person who rs good o" o" 
dimension is more likely also to be good on whatever other dimensions are bemg considered. rue 
vTetJer the "goodness" metric is derived from ratings, fro: tests, or from any other reasonable source. 
TTierefore, some portion of "halo error" may reflect true conditions, and be no error at aU. 

2. Inter-judge agreement may sometimes be a sufficient basis for comparing sets of rating 
statements, but it is nof unusual for groups of judges to agree on a decision which additional facts ^ow to 
be in error If one may postulate individual differences among raters in respect to their abil'.ty to perceive 
Lte« a^^ately , whii seems plausible, then one must agree that some raters will provide better ratings. 
Tome raters are better than others, it seems naive to expect that their ratings of a given characteristic will 
fall eternally at the mean of ratings given on that characteristic. 

3 In this study, an approach is taken which provides a better basis for making comparisons across 
rating sets than does the traditional psychometric comparison. The approach is constructed ^ound the 
concept of "hits"; that is. the number of times a rater can correctly identify anonymous profiles of his 
peers, constructed around various sets of descriptor statements. 

If a rating statement is usef-ol in describing a person, and if a group of raters can agree to some extent 
on the elevation of this characteristic in a ratee, then a profile of this ratee produced from a set of such 
statements should be identifiable as a rating "picture" of that individual. If a group of raters can reco^ize 
the individuals whom their profiles describe, then it seems more Ukely that the set of profUed 
characteristics can be useful in evaluating or predicting the performance of those '"drnduals. The number 
of "hits" (correctly labeled profiles) should be useful in comparing one set of ratmg descnptions with 
another. 

One analysis was made using hits as the dependent variable. The number of hits, however, at least in 
prior research (Curton, Ratliff, & MuUins, 1977). has proved so smaU that soniething mae sensitive was 
needed A rater could conceivably misidentify the first profile considered; and that misidentification could 
cause him to m:~ the rest, even if only by a small margin-or he could be so insensitive to personal 
differences that he makes guess errors in aU the identifications. The search for a """^'^^-^^.^^'^^^JP;"*^", 
identification led to the use of the rank-order correlation as a possibly more effective measure of 
identification of peers than the simple count of correct identifications. 

If a rater trying to identify anonymous profiles of his peers is confronted with 15 profiles, three of 
which have been rated very high on a particular characteristic, and ^^^^ "eUeves correctly that pee^B, H 
and J are the three in his peer group highest on this characteristic, he may not know which of the three is 
peer B. He might specificaUy mididentify all three profiles, even though he has been correct m believing 
Lt these three profiles, as a set, represent peers B, H, and J. Although he has '=0"^^f°=*' ""f " °J 
exact identifications, or hits, among these three profiles would be zero, no better than it would Jo "m^ 
less astute rater who believed B, H, and J were the lowest three in the peer group on that charactenstic. In 
short the "hits" measure contains no provision for crediting near misses, but the correlation between the 
ranking of unidentified profiles and the ranking of his named peers on the success dmiens.ons should 
provide a continuum which the raw "hits" metric does not possess. A rankorder correlation between these 
two ranks should provide a sensitive measure of recognition far more powerful than the simple count of 
matched profiles. 

Data Analysis 

In order to apply the metric described in the preceding paragraph, three rankings were collected^ 
First, an official ranking (OR) of the students, performed by the school, was available. Second, a radcing of 
the anonymous profiles (UP) was coUected. FinaUy , a ranking of seminar members by their peers (PR) was 



8 

^ a 1 

ERIC 



collected. This ranking was made using only a list of peer names, not profiles, and was made according to 
predictions of success in training. 

The UP and PR rankings were group average ranks derived by summing all of the assigned ranks for 
each person in his seminar group, then converting that total sum of ranks back to a rank order ranging from 
1 to 13 or 14 depending on the seminar's group size. These average ranks, UP and PR, represented a group 
conscensus on the perception of each seminar member by tlie group. Tlie Official Class Rank (OR) was 
determined by class standing on four exams (312 points), drill evaluation (25 points), student evaluation 
(25 points), and communication skills (38 points). 

Rank-order correlations for each rater were computed for the following purposes: 

1. Correlation between unidentified profile ranking and named peer rankings (UP-PR)-One 
correlation coefficient was computed for each rater and was viewed as a more sensitive measure of hits than 
is the number of exact identifications of unlabeled profiles. Tliis produced a new variable, the logic of 
which was explained above. 

2. Correlation between unidentified profile rankings and official class rank (UP-OR)-One 
correlation coefficient for each rater. This variable indicates how well the rater can evaluate the operational 
criterion (OR) in terms of the statements available. Differences in effectiveness among the statement sets 
should be revealed in differences between the sizes of the average correlation coefficients. Average 
correlation coefficients across groups could have been computed by summing the numerators in the rho 
formula (eSd^) and dividing by the sum of the denominators [N(N^ - 1)] . The squared deviations (d^) 
were used in the analysis of variance (ANOVA) since in this instance it provided a simpler and more 
accurate measurement variable in examining rank order effect than did the correlation coefficients 
themselves. 

3. Correlation between names peer rankings and official class rank (PR-OR)-One for each rater. The 
average of this correlation coefficient would normally indicate the efficiency of peer ratings in predicting a 
criterion. In this case, however, there was considerable evidence that most of the subjects were well aware 
through intra-group discussion of how their peers had done on previous tests and were consequently aware 
of how they stood on the overall class evaluation. In short, they were ranking on direct information about 
their peers rather than on judgment based on indirect knowledge. 

The primary analysis included testing to see if significant differences existed in terms of hits and the 
other dependent variables among the three treatment conditions. Since each seminar group was randomly 
assigned to one of the three treatment conditions, the experimental design resulted in the nesting of three 
seminar groups under each treatment conditions. The hierarchical design (^Nested Factors) is usually used to 
test the effects among a number of treatments in certain types of experimental situations (Winer, 1962). 
Typical examples include investigating drug effects among a number of hospitals, studying teaching 
methods among a number of schools, or studying training methods among different individuals. 

The hierarchical ANOVA is an efficient method of studying such experimental situations because it 
avoids multiple t-tests or non-orthogonal comparisons (Hays, 1963). The two-way hierarchical ANOVA in 
this experiment is also a more powerful statistical test than a one-way ANOVA that only tests for treatment 
effects, ignoring any group effects. In this design, the nested factors are controlled by statistical procedures. 
In many experimental situations, it is dangerous to assume that certain nested factors have no significant 
influence on treatment effects. 

Two sources of variation were observed in the experimental data. The treatment effect was of 
primary interest, whereas the seminar group affiliation was of secondary interest. The null hypothesis, i.e., 
no difference between treatment means, was tested for both investigated sources of variation. The analysis 
of both sources of variation was accomplished by performing a two-way hierarchical ANOVA for 
experiments with unequal cell sizes, using the least-squares procedural method described by Timm and 
Carlson (1975). 

"Hits" and the sum of the squared differences between UP and PR rankings, UP and OR rankings, 
and PR and OR rankings were the dependent variables used in the ANOVA analysis to determine whether 



9 



M2 



««.inc«.t differences exitted among treatment conditions. Iht squared differences between rank ordering 
Z!Z raSer thtn ^rl^cTrdef co.elations sir,ce the squared differences provided a sunpler and more 
accurate measurement variable in examining rank order sjmilanty. 

III. RESULTS AND DISCUSSION 

The hierarchical ANOV A summary for ^^hits," or correct identification of profiles is shown in Table 
1 As?xpec eTTe "hir measurrmenT variable showed no significant diffeiences among treatments, n 
L^ce the rating for each individual produced by the three different sets of rating sta« 

!^e equi in^t^h^^^ descriptive power. However, seminar group effects within treatments were s.gn.fi nt at 
r 01 let (T^^^^^ 1). TaEle 2 Lws the summary results of hits for seminar groups wUhin treatments. 

Table I Analysb of Variance of Number of "Hits" (Correct Profile 
Identtficattons) by Treatment and Seminar Croup 



Seurct 



Sumdf S4uarw 



M««fi Square 



Treatment 

Seminar Groups Within Treatments 
Error (Within Groups) 



6.215 
53.421 
304.379 



2 
6 
111 



3.107 
3.903 
2.742 



.349 
3.247* 



•Significant at .01 level. 



Table 2. Number of Profile Idcntificatioiw (Hits) by Treatment 
and ' y Seminar Group 



Tr«atm«fit 1 
(Seminar Croup) 



Traatmant 2 



TrMtmant 3 
(Samtnar Group) 



c 


e 


H 


B 


D 


o 


14 


13 


13 


13 


13 


14 


32 


26 


48 


23 


29 


42 


2.29 


2.00 


3.69 


1.77 


2.23 


3.00 


1.90 


1.68 


1.55 


1 JO 


uo 


.96 




40 






40 






106 






94 






2^5 






2J5 






\S3 






1.27 





13 

24 
1.86 
1.63 



14 

47 
3.36 
1.82 

40 

116 
2>J0 
2. 05 



13 
45 
3.46 
2J7 



Group 
Total N 
Total HiU 
Mean Hits 
SD Hits 
Traitment 

Total N 
Total Hits 
Mean HiU 
SD nts 

T-Ratios 

Treatments 1 vs. 2 Comparison 
Treatments 1 vs. 3 Comparison 
T reatments 2 vs. 3 Comparison 

notslgniricant. 

The .verage rank-order corrda.ior,s betweer, .he pairs of rankings appear in ^ablc 3^U»J^ J'^o^ 
HQfift^ t.hU. of^«niricance for Spearman rhos. 25 of the possible 27 rhos were significant at the .05 level. 
P H^Jr^o™ m^Hf Se rine collations possible each treatment group we« significant at the .01 lc«l 
'rtZ »d «V cTre^tion in eac^f trcatmenU 11 «,d III was not signiHcant. All correlations 



t = .574"' 
t= 1.44"' 
t= .85 



ns 



10 



nb!e S. Rank Order Correlations Among Unidentified Prorde Rankings, 
Peer Rankings, and OfTidal Rank by Treatment and by Seminar Group 



Ranlc Or4«r 
Compartoont 


1 (Worfctr) 
Scmlnir Qroiipt 

W 1 


A 


TrMtmtnts 

M (Task) 
Seminar Groups 

c c 


H 


III (Trait) 
Stminar Groiipi 

B D 


G 


UP and PR 


.58* .86** 


.87** 


.85** .86** 


.90** 


.79 


.90* ♦ 


.71** 


UP and OR 


.52* .7r* 


.82** 


.43 .65* 


.85** 


.37 


.72** 


.70** 


PR and OR 


.87*» .93** 


.97** 


.57* .79** 


.94** 


.74** 


.79** 


.97** 


Total N 


13 14 13 




14 13 13 




13 


13 14 





Not*. Critical values of rho, the Spearman rank correlation, were obtained from 1 erguson (1966). Table G. p. 414. 

•Sifeniffcaint at .05 level. 
••Slv -4nt It .01 level. 

demonstrated a simaar pattern of significance in each of the three treatment conditions. The tliree rank 
order comparisons showed a high degree of agreement. This data analysis suggested that no one type of 
rating statement was superior for use in performance appraisal instruments. The purpose of these rank-order 
comparisons was to see whether the pattern of significance under each treatment was generally similar or 
different. However, the most definitive test for determining differences between treatments was the 
hierarchical ANOVA analysis. 

Tables 4 to 6 show the hierarchical ANQVA summary for comparison of the rating statement 
treatment conditions with respect to the squared difference between the foUowing rank-order comparisons: 
UP-PR, UP-OR, and PR-OR. The ANOVA results showed no significant difference between treatment 
conditions as reflected by the squared differences between the UP-PR rankings (viewed as a more sensitive 
measure of identification of unlabeled profiles), the UP-PR rankings (which indicate how well the rater can 
evaluate the operational criterion in terms of given stimulus statements), and the PR-OR rankings (normally 
indicating the efficiency of peer ratings in predicting a criterion). 



Table 4. Analysb of Variance of Squared Deviations between Unidentified 
Profile Raricings and Peer Rankings by Treatment and by Seminar Group 



Source 


Sum of Souarm 




Main Square 


F 


Treatment 

Seminar Groups Within Treatments 
Error (Within Groups) 


9396.114 
J4 1470.876 
945985.099 


2 
6 
111 


4698.057 
40245.146 
8522J88 


.117 
4.722* 



^Signincaint at .01 level. 



Table 5. Analysis of Variance of Squared Deviations between Unkientified 
Proffle Raricings and Official Rankings by Treatment and by Seminar Group 





Sum of Squarot 


tff 


Moan Square 


F 


Treatment 

Seminar Groups IMthin Treatments 
Error (Within Groups) 


121 2T. 327 
394350^700 
55H97(b.730 


2 
6 
111 


6063.663 
65721.783 
5035.836 


.0922 
13.051* 



*Sigitffkantat .01 tmt. 




TabU 6. Analysis of Variance of Squared DevUtions between Peer Rankings 



Soiirc* 


Sum of Squam 




Mun Squartt 


F 


Treatments 

Seminar Groups Within Treatments 
Error (Within Groups) 


1 19253.060 
465196.015 
532553.566 


2 
6 
111 


59631.530 
77532.668 
4797.780 


.769 
16.160* 



* Significant at .01 level. 

The PR^R rank order coefficient, however, cannot be considered an unbiased indicator since there 
was con^derable evidence that most subjects were ranking on information based on knowledge of tes 
^^form" ce acquired through intra-group association, rather than judgment based solely on observation of 
peer activities and traits. 

Although no significant rank^^rder differences were found between treatment conditions, as reflected 
by the squared differences of the various pads of rankings, the differences between f°"P« ^^"^ 

treatments on aU three ANOVA analyses were significant at the .01 level (Tables 4 . 5 . and 6). This was an 
une r^^^^^^ because each seminar group was randomly assigned to one o the three treatment 

condm^s. The reLts demonstrated that no one type of content rating statement was superior to any 
other in determining rank-order differences. 

The data analyses showed that the statements investigated here yielded no significant '■dvantagcs for 
one set of statements over another. It makes no difference whether the «ting statements are 
worker.,riented. or trait.,riented. This study provides additional evia.nce ^^at the doub s of ^U. Hof^^ 
and Hoyt (1963). Borman and Dunnette (1975). and Kavanagh. MacKinney. and Wollins ( 971) about t^ 
leriority of job-oriented dimensions over trait.,riented dimensions were weB founded As Kavanagh 
(w'n included from his comprehensive literature review of performance appraisal studies. the« is no 
.^on t^assume the superiority of job.riented statements over tr^.-onented ^^"^'^^^^^^^ 
rating statements for inclusion in performance appraisal devices should primardy be ^ '^"^'^f '^y'^'' 
considerations. Cost considerations tend to favor trait .oriented statements in most situations, suice the pb 
analysis required to obtain task^rientcd and worker.)riented statements is cosUy and time consummg. 
Moreo«r. trait^riented statements are also more generalizable across different occupations than is either 
task-oriented or worker-oriented statements. 

Unlike many prior studies, this study does not conclude with a condemnation of judgmental rating 
statementfmsTuJy suggests that peer group person .oriented statements are as effective as job descnpt.ve 
statlmenl!" when thi stSdard is external criterion, such as the ability to recognize peers from 
unidentified profiles or the abUity to predict their official dass rank. 

An unexpected finding was the significant effect associated with seminar groups on 
ANOVM. paJLarly since all seminar groups were randomly assigned to each treatment condition. The 
Tporta^ e of reco^izing and controUing for group effects in such performance ^v-l^tion tud.„ « 
7dent. InvestigateTtreatment variables might easQy become contaminated by group effects leading to 
LaJ"«te lesull and conclusions. Tlae reasons for these significant group effects are unknown, although 
such intra-group variables as morale, leadership, and attitude are possible causal influences. 

It may be that performance appraisal research emphasis has not been placed on the "^"t ^^P^^^ 
variables, ftrhaps there ;^e environmental influences that affect perforrnance rating, more than t«n^l« 
.mfbutable to ^e appraisal device. Perhaps such issues as content, format, scale, etc.. are relatively 
In^Xt as com^^'eTS these other variables. A need also exists to broaden the research focus m 
PZTcc appraisaTstudies focusing on criteria independent and external to the perform«ice appraisal 
device. 



12 

15 



IV. SUMMARY AND CONCLUSIONS 



Three different kinds of rating stimulus statements, differing along a dimension of trait-oriented to 
task-oriented descriptions, were compared in a context which permitted the comparisons to be made in 
terms of criteria external to the ratings. No evidence of superiority was found for any of the three sets, 
although many significant correlations with various external criteria were obtained in all three experimental 
conditions. 

Significant differences were also found among the three rating sub-groups comprising each of the 
three treatment groups although these rating sub-groups were assigned randomly to the three treatment 
groups. The importance of controlling for group effects in peer group studies was noted. 

REFERtNCES 

Arvey, R.D., & Hoyle, J.C. A Guttman approach to the development of behavio rally based rating scales for 
systems analysts and programmer/analysts. Journal of Applied Psychohgy, 1974, 59, 61 -68. 

BamiURS. Performance rating. Chicago: Science Research Associates, 1966. 

Bell, F.O., Hoff, A.L., & Hoyt, K.B. A comparison of three approaches to criterion measurement. /ownu/ 
of Applied Psychology, 1963,47, 416 -418. 

Borman, W.C., & Dunnette, M.D. Beha ior-based versus trait-oriented performance ratings: An empirical 
study. Journal of Applied Psychology, 1975,60. 561-565. 

Bonnan, W.C., & VaDon, W.R. A view of what can happen when behavioral expectation scales are 
developed in one setting and used in ^nothti. Journal of Applied Psychology, 1974, 59, 197-206. 

Bumaska, R.F., k HoUmann, T.D. An empirical comparison of the relative effects of mter response biases 
on three rating scale formats. /oi/mj/ of Applied Psychology, 1974, 59, 307-312. 

CampbeD, D.T., & FIske, D.W. Convergent and discriminant validation by the multitrait-multimethod 
matrix. Psychological Bulletin, 1959, 56,81-105. 

Campbell, J.P., Dunnettn, M.D., Arvey, R.D., & Hellervik, L.V. The development and evaluation of 
behaviorally based T^iingscdlcs. Journal of Applied Psychology , 1973,57, 15-22. 

Campbell, J.P., Dunnette, M.D., Lawler, E.E., & Weick, K.E. Managerial behavior, performance, and 
effectiveness. New York: McGraw-Hill, 1970. 

Curton, E.D., Ratliff, F.IL, & MuUins, C.J. Content analysis of rating criteria. Proceedings of Symposium 
on Criterion Development for Job Performance Evaluation^ 23-24 June 1977. 

Dickinson, T.L., & Tice, T.E. A multitrait-multimethod analysis of scales developed by retranslation. 
Organizational Behavior and Human Performance, 1973, 9, 421-438. 

Dunnette, M.D. Personnel selection and placement. Belmont, CA: Wadsworth, 1966. 

Ferguson, GJi. Statistical analysis in psychology and education. New York: McGraw-Hill, 1966. 

Hays, W.L. Statistics for psychologists. New York: Holt, Rinehart, and Winston, 1963. 

Kavanagb, MJ. The content issue in performance appraisal: A review. Personnel Psychology, 1971. 24, 
653-668. 

Kavanagji, MJ., MacKinney, A.C., & Wollins, L. Issues in managerial performance: Multitrait-Multimethod 
analysis of ratings. Psychological Bulletin, 197 1 , 75, 34-49. 

Kelley, E.L.,ft Fake, D.W. The prediction of performance in clinical psychology. Ann Arbor: University of 
Michigan Press, 1951. 

Landy, FJ., Farr, J.L., Saal, F.E.,& Freytag, W.R. Behaviorally anchored scales for rating the performance 
of police officera. Journal of Applied Psychology, 1976, 61, 750-758. 



13 



Century Crofts, 1971. , „ 

Schmidt F.L.. & Kaplan. L.B. Con^posite versus multiple criteria: A review and a resolut.or, of the 

controversy. Personnel Psychology, 197 1. 24, 419-484. 
Smith P.C.. & Kendall. L.M. Retranslation of expectations: Approach to the construction of unambiguous 

ani^rs for rating scales. ym.r«./o/.lw//c^ . 
T.™™ N H & Carbon J E Analysis of variance through full rank nu.dels. Multivariate Behavaoral 

ke^fch Mo?otaph ^ 75-1. Published by the Society of Multivariate Experimental Psychology. 

1975. 

Winer BJ Statistical principles in experimental design. New York: McGraw-Hill. 1962. 
Z^Hndc S & Baker HT Nursing performance as measured by behavioral ^'^P"'^''"" f,!^": ,^ 
''"^tul^IuaS-muthateranalysis. <^ni.ationa, Be,..ior and Hu,nan Performance, 1972. 7. 457-466. 
Zedeck. S.. & Blood. M.R. Fou,M>ns of behavioral science research in organizations. Monterey. CA: 
Brooics/Cole. 1974. 



14 

17 



APPENDIX A: RATING DlMliNSlONS 



IS 
■r 18 

O 

ERIC 



WORKER-ORIENTED RATING DIMENSIONS 



Below 
Average 

1. Military appearance (A) 

2. Participates in class 
activities (A) 

3. Communicates clearly by 

oral and written methods... (A) 

It. Amount of assistance to 

peers in work assignments.. (A) 

5. Completes work in a timely 
manner 

6. Follows provided 
instructions 

7. Takes accurate notes (A) 

8. Competence in analyzing 

work assignments (A) 

9. Awareness of safety 
precautions..... 

10. Studies well on his own... (A) 



Well 

Above Above Out- 
Average Average Average Average 



(B) 
(B) 
(B) 
(B) 
(B) 

(B) 

(B) 

(B) 

(B) 
(B) 



(C) 

(C) 

(C) 

(C) 

(C) 

(C) 
(C) 

(C) 

(C) 
(C) 



(D) 

(D) 

(D) 

(D) 

(D) 

(D) 
(D) 

(D) 

(D) 
(D) 



(E) 
(E) 

(E) 

(E) 

(E) 

(E) 
(E) 

(E) 

(E) 
(E) 



16 

19 

ERIC 



TASK-ORIENTED RATING DIMENSIONS 



Below 

Ave rage Ave rage 
Effective- Effective- 
ness ness 

Knows UCMJ pro- 
grammed text • . • • (A) (B) 

Contributes examples 
in seminar on Disci- 
pline and Unity of 
Command (A) (B) 

Promotes and 

organizes Community 

Project (A) (B) 

Analyzes courts- 
martial case study. (A) (B) 

Participates in 

Foreign Policy role 

playing (A) (B) 

Understands reasons 
for nonalignment of 
uncommitted nations. (A) (B) 

Knows history of 

AT uniform (A) (B) 

Applies the six-step 

approach to problem 

solving (A) (P^ 

Knows how to plan a 

conference (A) (B) 

Researches topic for 

Persuasive Speech.* (A) (B) 



Well 

Ah ove Above 
Ave r a ge Ave r a ge 
Effective- Effective- 
ness ness 



(C) 

(C) 

(C) 
(C) 

(C) 

(C) 
(C) 

(C) 
(C) 
(C) 



(D) 

(D) 

(D) 
(D) 

^D) 

(D) 
(D) 

(D) 
(D) 
(D) 



Out- 
standing 
Effective- 
ness 



(E) 



(E) 

(E) 
(E) 

(E) 

(E) 
(E) 

(E) 
(E) 
(E) 



17 

'^0 



TRAIT-ORIENTED RATING DIMENSIONS 



1. Honesty - straightforward 
and truthful in dealing 
with others 

2. Ambition " works hard, 
accepts challenges 

3- Dependability - does 
assigned tasks con- 
scientiously without 
close supervision 



Well 

Below Above Above Out- 

Average Average Average Average standing 



(A) (B) (C) (D) (E) 

(A) (B) (C) (D) (E) 



(A) (B) (C) (D) (E) 

4. Punctuality - prompt , . 

-in v^^nine eneacements. . . (A) W ^'^^ 



in keeping engagements. 

5- Quality of work - per- 
forms work accurately 
and effectively 

6. Quantity of work - 

produces a large amount 
of work that meets 
requirement standards,. 



(A) (B) (C) (D) (E) 



(A) (B) (C) (D) (E) 



7. Initiative - originates 



8. Adaptability - changes 
attitude and behavior 
to meet the demands of 
the situation 



(A) (B) (C) (D) (E) 



9. Originality - creative, 

thinks of new solutions , . 

to old problems (A) (B) (C) W CE) 

10. Agreeableness - gets 

along well with fellow . . . 

workers, well liked (A) (B) (C) (D) (E) 



|g *U.l C0V£UN«HTP«IIITIIIC0fflCI:l979- 67X-0Sf»/109 

21 

o 

ERIC 



